Skip to content

fix[next]: Fix segfault for nanobind >=2.10#2431

Open
tehrengruber wants to merge 7 commits intoGridTools:mainfrom
tehrengruber:fix_nanobind_segfault
Open

fix[next]: Fix segfault for nanobind >=2.10#2431
tehrengruber wants to merge 7 commits intoGridTools:mainfrom
tehrengruber:fix_nanobind_segfault

Conversation

@tehrengruber
Copy link
Contributor

@tehrengruber tehrengruber commented Jan 10, 2026

Temporary fix for a segfault with at least nanobind 2.10.2. The circumstances of the environment are not really clear yet, but this looks like a nanobind issue on first sight. I could reproduce the error with python 3.10 & 3.12.

Error message:

Fatal Python error: Segmentation fault

Current thread 0x00007d4596caeb80 (most recent call first):
  File "/home/tille/Development/gt4py_functional/src/gt4py/next/program_processors/runners/gtfn.py", line 67 in decorated_program
  File "/home/tille/Development/gt4py_functional/src/gt4py/next/backend.py", line 161 in __call__
  File "/home/tille/Development/gt4py_functional/src/gt4py/next/ffront/decorator.py", line 724 in __call__

Source of the error:
This innocent looking line in src/gt4py/next/otf/compilation/compiler.py triggers the error

getattr(importer.import_from_path(src_dir / new_data.module), new_data.entry_point_name)

After the getattr call the module importer.import_from_path(src_dir / new_data.module) is garbage collected resulting in a call to nanobind::detail::nb_module_clear. This function (for unknown reasons) garbage collects the value stored in static_pyobjects[pyobj_name::dl_version_tpl] which is used in nanobind/include/nanobind/ndarray.h when calling the compiled program in src/gt4py/next/program_processors/runners/gtfn.py.

Steps to debug:

  1. Make sure you have python with debug symbols and the python extensions for gdb installed
  2.  set args -m pytest tests/next_tests/integration_tests/feature_tests/ffront_tests/test_execution.py -k test_copy
     break init_pyobjects
     run
     # stop when `static_pyobjects[pyobj_name::dl_version_tpl]` changes
     watch static_pyobjects[pyobj_name::dl_version_tpl]
     # ignore first change in init_pyobjects to the value
     continue
     # now you should land in `nanobind::detail::nb_module_clear`, printing
     # a python backtrace reveals the line in which the problem occurs
     py-bt
    

The proper solution is likely for each function in nanobind to keep a reference to the module.

Update: we opened an issue in the nanobind repo reporting this bug: wjakob/nanobind#1283

@tehrengruber tehrengruber marked this pull request as draft January 10, 2026 09:46
@tehrengruber
Copy link
Contributor Author

Other also seem to be running into this issue unless they install with pinned versions, e.g. from uv.lock. @egparedes @havogt I would like to not invest more time into investigating the issue in nanobind and potentially fixing it there at this moment. Should we either a) merge this (with an improved comment) or b) restrict nanobind to 2.9 for the time being.

@egparedes egparedes changed the title fix[next]: Fix nanobind segfault fix[next]: Fix segfault for nanobind >=2.10 Feb 4, 2026
@egparedes egparedes marked this pull request as ready for review February 4, 2026 13:14
Copilot AI review requested due to automatic review settings February 4, 2026 13:14
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request implements a workaround for a segfault issue that occurs with nanobind version 2.10 and later. The root cause is that nanobind extension modules are being garbage collected while their functions are still in use, leading to crashes when calling those functions with ndarray arguments. The PR addresses this by creating a wrapper class that holds references to both the module and the function, preventing premature garbage collection.

Changes:

  • Added import of cast from typing to support type casting
  • Modified the compilation process to wrap the compiled function in a dynamically created class that maintains references to the extension module
  • Added detailed comments explaining the workaround and linking to the upstream nanobind issue

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@egparedes egparedes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants